import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import bokeh
from bokeh.plotting import *
output_notebook()
plt.rcParams['figure.figsize'] = (15.0, 8.0)
The next cell imports the data from CSV files on the hard drive into a Python object called a DataFrame. This object is the workhorse of an excellent Python data processing library called Pandas, and has convenience methods to help deal with common data operations including dealing with timeseries, filtering, concatenation/merging, plotting, and much more. The read_csv function is a convenience function provided by Pandas to easily process a CSV file including parsing timestamp columns into appropriate types and setting up an "index" column. After we've loaded the data, we use the head() method to print out the first few lines of data in the set so that we can see the form that it has in Python.
Because of a glitch in this data collection, the data is separated into 2 CSV files, which we have to load separately. We use the concat() function to combine these two files into a single DataFrame object. Pandas is smart enough to realize that both of these DataFrames have the same columns and the same index, so it just combines them and ensures that the order of the data rows is correct.
def load_lemsens_csv(path):
return pd.read_csv(path, infer_datetime_format=True, parse_dates=True,
index_col=0)
df1 = load_lemsens_csv("./14732.csv")
df2 = load_lemsens_csv("./lemur-2-peter_ardusat-experiment-15211.csv")
data = pd.concat((df1, df2))
data = data.tz_localize(tz='utc')
data.head()
The first thing we're going to do with the data is look at the distribution of the IR temperature variable. The easiest way to do this is to look at a histogram of the data. A histogram is type of plot that shows a series of bars over the entire range of values found in a variable. The entire range is divided into a series of "bins," which are represented as bars on the output plot, with the left side of the bar being the lower limit of a given bin and the right being the upper limit. The height of the bar is the number of data points whose value is within that bin's range. The resulting plot is an estimate of the distribution of the data variable.
sns.distplot(data.mlx90614_ir_temp)
One thing we immediately can see in the plot is that there are a few "extreme" values in this dataset (values that are very high or very low compared to the vast majority of the data). It is possible that these values have a physical origin - we know that the satellite's guidance system sometimes fails, which could mean that the sensor is no longer in a stable orientation looking down at Earth. A value taken with the sensor pointing into deep space could account for a very low value; a value taken looking at the Sun could account for a very high value. In addition, space is a harsh environment for electronics that is sometimes right on the edge of what this sensor is rated for, so it's also possible that these values are merely erroneous readings.
In either case, for the purposes of analyzing this data to try to extract some physical meaning, we would like to concentrate on/zoom in on the majority of data which seems "reasonable". To do this, we'll create a second histogram that uses a broad filtering function (ignoring values < -100 C or > 20 C, both values based on the appearance of the above plot) to only look at a subset of this data.
idx = (data_with_wx.mlx90614_ir_temp > -100) & (data_with_wx.mlx90614_ir_temp < 20)
sns.distplot(data_with_wx.mlx90614_ir_temp[idx])
This plot shows the bulk of the data centered around the ~-25 to 0 C range with a negative skew. This means that when we try to look for linear correlations between variables and want to limit the influence that extreme values have on relationship, we can probably look only at values in the ~ -50 to 10 C range.
Interestingly, almost all the data are below 0 C. Given that the satellite is going over tropical regions, this is certainly not what we would expect if the IR sensor was seeing surface temperatures. We will investigate this discrepancy in the next section.
For the first step in our data analysis, we would like to look for correlations between columns in the dataset that might suggest a causal relationship between variables. There are a number of ways to look for correlations, but for this exercise, we are going to look for linear correlations between two variables (e.g. when variable 1 is in a certain range, variable 2 is usually in another predictable range). If a linear relationship between 2 variables exists, we should be able to fit a line to a scatter plot of the two variables using an equation of the form $mx + b = y$.
In order to visualize these relationships, we use convenience functions provided by the excellent Seaborn plotting library. This Python library is a wrapper around the venerable MatplotLib, which provides raw plotting utilities using an API very similar to that found in Matlab, a data processing platform and language common in engineering and science. Seaborn adds good looking default styles, setup functions, and common statistical operations to the basic functionality provided by Matplotlib. In particular, we will be using the jointplot function, which takes two variables, plots a 2-D scatter plot of the variables, adds a histogram of the distribution of values for each variable, and calculates a Pearson correlation coefficient, which is basically a value describing how strong of a linear correlation exists between the variables. It ranges from -1 to 1, where -1 is a perfect negative correlation, 1 is a perfect positive correlation, and 0 is no correlation.
One hypothesis we would like to examine is whether there is a relationship between the observed IR temp and the Eclipse Depth variable. This variable describes how deep the satellite is into the Earth's eclipse - e.g. whether or not the Earth is blocking the sun or not from the satellite's point of view. It is measured in degrees, with negative values indicating that the sun is partially or fully eclipsed by the Earth, meaning the satellite is flying over nighttime regions of the earth. We would like to see if the observed temperatures are strongly correlated with whether or not the satellite is over the day or night side of the Earth.
sns.jointplot('eclipse_depth', 'mlx90614_ir_temp', data, size=12, ylim=(-50, 5))
As we can see, there is virtually no correlation between these two variables (the plot is essentially random - if there was a correlation, we should be able to draw a single line that more or less "follows" the data). The Pearson's coefficient is -0.0032, which is very close to 0 (no correlation).
We would like to be able to see if there is a relationship between the location on Earth where it was collected and the data value. Since the location on Earth is already a 2-D value (latitude and longitude), we will need to use a third dimension to visualize the actual temperature. There are a number of ways to do this, but in this case we've chosen to use the color of a plotted point to represent the temperature.
In order to better visualize the data, we've generated a map for each day that we have data for, as well as a summary map (at the end of the series) that has all of the data points plotted on a single map.
from mpl_toolkits.basemap import Basemap
import numpy as np
def add_map(fig, title, data, total_maps, map_num):
ax = fig.add_subplot(total_maps, 1, map_num + 1)
ax.set_title(title)
map_ = Basemap(projection="merc", llcrnrlat=-10, llcrnrlon=-180, urcrnrlat=10, urcrnrlon=180,
lat_ts=0, resolution=None)
map_.shadedrelief()
# Because of the way that Basemap works, we need to sort by longitude first
# (see https://github.com/matplotlib/basemap/issues/214)
lon_sorted = data.sort_values('long')
#extreme_vals = np.abs(lon_sorted.mlx90614_ir_temp - lon_sorted.mlx90614_ir_temp.mean()) >= 3 * lon_sorted.mlx90614_ir_temp.std()
extreme_vals = (lon_sorted.mlx90614_ir_temp > 10) | (lon_sorted.mlx90614_ir_temp < -35)
lon_sorted = lon_sorted[~extreme_vals]
temps = map_.scatter(lon_sorted.long.values, lon_sorted.lat.values, latlon=True,
c=lon_sorted.mlx90614_ir_temp.values, cmap=plt.cm.jet)
color_bar = map_.colorbar(temps, "bottom", size="5%", pad="2%")
fig = plt.figure(1, figsize=(30, 30))
def filter_by_date(data, date):
"""Filter a DateTime indexed DataFrame by date
"""
mask = data.index.date == date
return data[mask]
dates = np.unique(data.index.date)
num_plots = len(dates) + 1
for i, date in enumerate(dates):
add_map(fig, "%s" % date, filter_by_date(data, date), num_plots, i)
add_map(fig, "Full Dataset", data, num_plots, num_plots - 2)
plt.show()
Looking at these plots we can see some regions that consistently have cooler temperatures (over the Amazon rainforest in South America, and Central Africa, for example), and some areas with warmer temperatures (the Central Pacific Ocean). We can also watch some areas warm or cool (Singapore appears noticeably warmer on 2016-06-21 than it was on 2016-06-19, for example). One hypothesis we could draw from this that we would like to test is whether or not this was caused by changes in weather.
We would like to see if weather can explain some of the variation that we see in this data. Our hypothesis is that water vapor in the atmosphere (clouds and humidity) absorbs the bulk of IR radiation, so when we look through the atmosphere with the IR sensor, we expect to see variations based on the water vapor present. There are a number of online weather sources, but ForecastIO/DarkSky has an easy-to-use API with a nice Python Wrapper that makes querying for weather data easy. A free ForecastIO account gives access to 1000 queries per day.
The following code loops through each point in the dataset and issues a request (via the load_forecast function) to the Dark Sky API. The cloud cover, dew point, and humidity fields are then pulled from this record and added to the dataset for analysis.
import forecastio
def add_weather_by_row(row, cache=None):
res = forecastio.load_forecast(FORECAST_IO_API_KEY, row.lat, row.long, time=row.name)
wx_keys = [
"cloudCover",
"dewPoint",
"humidity",
"ozone",
"precipAccumulation",
"precip",
"precipIntensity",
"precipType",
"precipType",
"pressure",
"temperature",
"visibility",
"windBearing",
"windSpeed",
]
if len(res.hourly().data) > 0:
point = next((p for p in res.hourly().data if p.time.hour == row.name.hour), res.hourly().data[0])
weather_data = {}
for k in wx_keys:
try:
weather_data[k] = getattr(point, k)
except forecastio.utils.PropertyUnavailable:
weather_data[k] = np.NaN
weather_data = pd.Series(weather_data)
else:
weather_data = pd.Series({k: np.NaN for k in wx_keys})
new_row = pd.concat((row, weather_data))
if cache is not None:
cache[new_row.name] = new_row
return new_row
data_cache = {}
data_with_wx = data.apply(lambda x: add_weather_by_row(x, data_cache), axis=1)
# Save this data so that we don't have to hit the API again next time we run this code
data_with_wx.to_csv("./dataset.csv", header=True)
Here we look at a joint plot of the relationship between cloudCover and IR Temp. We filter the dataset for extreme values which are likely due to some kind of noise in the data gathering process so that we can better see the "normal" relationship between the variables
extreme_vals = (data_with_wx.mlx90614_ir_temp > 10) | (data_with_wx.mlx90614_ir_temp < -55)
sns.jointplot('cloudCover', 'mlx90614_ir_temp', data_with_wx[~extreme_vals], size=12, ylim=(-50, 5), kind="reg")
There is still a fair amount of noise in this plot, but at least we're seeing a clear (and, based on the p-value, very unlikely to be just random noise) relationship between the variables. This is encouraging - it would appear that we can detect some degree of cloudiness with the IR sensor! The relationship is negative, which means that the higher the cloud cover percent is, the lower observed IR temperature is. One possible explanation for this is that with greater cloud cover, the IR is all getting blocked by the tops of clouds which are higher in the atmosphere, and thus colder than the warm air near the surface.
Another interesting feature of this dataset is that the distribution of cloudCover data appears to be multi-modal - there are a few values (0, ~0.33, 0.66, 1.0) that have a large number of observations relative to the values in between. Can you think of any reason this might happen, based on what you know about weather stations and weather observation?
extreme_vals = (data_with_wx.mlx90614_ir_temp > 10) | (data_with_wx.mlx90614_ir_temp < -55)
sns.jointplot('humidity', 'mlx90614_ir_temp', data_with_wx[~extreme_vals], size=12, ylim=(-50, 5), kind="reg")
There appears to be a reasonable correlation between humidity and IR temperature as well, although the relationship is a bit less strong than cloud cover. This again fits with our hypothesis - higher humidity means more water vapor in the atmosphere, which means that the IR will be able to see less far into the atmosphere and thus be seeing temperatures from higher up (and colder).
extreme_vals = (data_with_wx.mlx90614_ir_temp > 10) | (data_with_wx.mlx90614_ir_temp < -55)
sns.jointplot('dewPoint', 'mlx90614_ir_temp', data_with_wx[~extreme_vals], size=12, ylim=(-50, 5), kind="reg")
extreme_vals = (data_with_wx.mlx90614_ir_temp > 10) | (data_with_wx.mlx90614_ir_temp < -55)
sns.jointplot('temperature', 'mlx90614_ir_temp', data_with_wx[~extreme_vals], size=12, ylim=(-50, 5), kind="reg")
There is almost no correlation between surface temperature and our observed IR temperatures. In one sense, this is unfortunate - for many earth-observing experiments, it would be nice to be able to see surface temperatures with the IR detector. However, it fits the hypothesis that the primary driver behind the IR values we're seeing is water vapor in the atmosphere.
This is just scratching the surface of analysis for this dataset. We have loaded in weather data for each data point, and shown some basic variables that appear to have some relationship with our measured values.
Can you build a linear model that predicts e.g. cloud cover or surface humidity based on observed IR temperature? How accurate is your model?
Look at a specific area on Earth that shows changes in IR temperature over the course of this study. Does the change in observed IR temperature match up with changes in weather?
The relationships described above are certainly not perfect fits (they don't explain all of the data). Brainstorm some reasons why the data might not match up exactly.
The field of view of this IR sensor is 90 degrees. Based on the altitude of the satellite, roughly how big of patch on Earth can the sensor see at each data point. Could you improve the data set by loading in more weather data rather than just relying on a single point from the center of this circle?
Based on what you learned from this experiment, what would you want to look at for your next experiment?